Welcome to the final article in our code optimization trilogy!
In the first article, we established the principles and decision framework for code optimization. In the second article, we explored how programming language choices, code design, and algorithms shape efficiency at a high level.
Now we’re diving into the most technical aspect: hardware utilization. This is where we learn to leverage our computational resources—CPUs, memory, and storage—to their fullest potential. We’ll cover vectorization, parallelization, and memory management techniques that can dramatically improve performance while keeping code maintainable.
Let’s get into it!
Hardware utilization refers to how code leverages computational resources. For example, vectorization and parallelization help us squeeze every last drop of juice from our CPUs, while in-place modification, object size pre-allocation, and on-demand data access are useful to manage memory usage.
Vectorization refers to the application of an operation to multiple elements simultaneously.
At the hardware level, vectorization is enabled by an architectural feature known as Single Instruction Multiple Data (SIMD). SIMD operations can, for example, sum 16 pairs of vector elements simultaneously within a single core, offering substantial speed-ups. However, only compiled languages (C, C++, Fortran, etc) can leverage SIMD instructions via specific compiler optimizations.
At the software level, many languages implement vectorized
semantics. Think of adding two vectors b and
c with the expression a = b + c. This
abstraction makes code concise, and can also unlock performance gains in
different ways. In compiled languages like Fortran, such expressions are
typically optimized for SIMD vectorization. In interpreted languages
like R, many vectorized functions are backed by compiled code. For
instance, primitives like + are implemented as fast C
loops, that may or may not be optimized for SIMD by the compiler (see
the section R side: how can R possibly use SIMD? in this excellent StackOverflow
answer for details). In contrast, matrix operations rely on blazing-fast
matrix algebra backends such as BLAS
and LAPACK, which
explicitly exploit SIMD vectorization (and parallelization!).
Some functions offer vectorized semantics without performance gains.
This is the case with R functions like apply(),
lapply(), purrr::map(), and the likes, which
are essentially loops in a trenchcoat.
By combining SIMD vectorization for raw performance with semantics-level vectorization for expressiveness, we maximize hardware utilization while keeping our code clean and efficient.
lapply())Parallelization accelerates execution by spreading independent tasks across multiple cores.
At the software level, parallelization can be achieved by spawning multiple processes, each with its own memory space, or by a process spawning several threads, all them sharing the same memory space.
Parallelization can be explicit and implicit.
Explicit parallelization requires the user to define
how and where parallel tasks are executed. This approach offers fine
control over execution but also demands more setup and understanding of
parallel workflows. In R, parallelized
loops written with the packages doParallel
and foreach
require defining a parallelization backend (a.k.a “cluster”), selecting
a number of cores, and a specific syntax
(y <- foreach(...) %dopar% {...}). That’s pretty
explicit if you ask me! Modern alternatives like future and future.apply
achieve the same results with a less involved code.
On the other hand, implicit parallelization happens
without user intervention or even knowledge. For example, the packages
arrow
and data.table
apply multithreading to parallelize many data operations. This is also
the case of matrix operations in R (i.e. GAM fitting with
mgcv::gam()), which are multithreaded by the matrix algebra
libraries BLAS and Intel MKL.
The CRAN Task View: High-Performance and Parallel Computing with R offers a more complete overview of the different parallelization options available in R.
In any case, parallelization has several requirements:
Even under ideal conditions, parallelization has well-known diminishing returns formulated in Amdahl’s Law. That just means that beyond some point we cannot simply throw more processors at our code and expect immediate efficiency gains.
Let’s jump into what’s IMHO the most interesting topic of this article: Memory Management!
Computers have a short-term
memory directly connected to the processor named main
memory, system memory, or RAM. Any code and data
required by a program lives (and sometimes dies) there
during run time. For example, when you start an R session, the operating
system assigns it a section of the system memory, and all functions of
the packages base, stats,
graphics, and a few others are read from disk and loaded
there. So does the code of any package you load using
library(), or the data your program reads from disk, or any
results it generates via models or other computations.
Main memory is FAST, but FINITE! If a program requires more memory than available, the operating system may start moving parts of the main memory to the hard disk (see memory paging and swap file), slowing things down. In extreme cases, a program can run out of memory and crash.
Also, a program repeatedly allocating and deallocating memory chunks of varying sizes usually accumulates non-contiguous free gaps between used memory blocks that are hard to re-allocate. This issue, known as memory fragmentation, leads to performance slowdowns and a higher memory usage that can end with a crash.
If your code tries to use more RAM than your system has available, things slow down dramatically or crash altogether.
Efficient memory management can help avoid these issues by ensuring that our code uses the system’s memory in a sensible manner.
Being memory aware is a first step in the right direction. And this is a silly concept, really, but keeping a memory monitor like htop and the likes open during code development and testing helps build an intuition on how our program uses memory.
Other good techniques we can apply to consistently improve memory management in R are in-place modification, pre-allocating object size, and on-demand data access.
Also known as modification by reference, it refers to object
modification without duplication (see copy-on-modify
in R). This is probably the most consistent strategy we can
apply to manage memory in R! The section
2.5 of the book Advanced R covers the technical
details, and offers great advice: We can reduce the number of copies
by using a list instead of a data frame. Modifying a list uses internal
C code, so … no copy is made. If data frames are your jam, then the
package data.table
may come as a life-saver, as it has an innate
ability to modify large data frames in place, making it fast and
efficient.
Growing data frames, vectors, or matrices in a loop triggers the copy-on-modify behavior of R and makes things very slow. This happens because R has to reallocate memory on each iteration for the object’s copy, which takes time and increases memory usage. But if growing something is unavoidable, either pre-allocate the object size, or better, grow a list, as lists are dynamically allocated (rather than pre-allocated) and don’t require their elements to be stored in contiguous memory regions. In any case, when in doubt, apply benchmarking to identify the most efficient method.
On-demand data access refers to several data handling strategies to work with data larger than memory.
Memory-mapped files are representations of large on-disk data in the virtual memory of the operating system. The operating system handles directly the on-demand reading and caching of specific portions of these files, which reduces memory overhead at the expense of increased disk reads (having an efficient SSD is a game changer here!) and computation time. In R, the packages mmap and ff (see brief tutorial here) offer low-level memory-mapping implementations, while the bigmemory package focuses on large matrices.
Chunk-wise processing involves explicitly dividing large data into smaller and more manageable pieces, making it a flexible solution for handling large-scale computations efficiently. For example, the package terra combines this technique with lazy evaluation when working with large raster files to control memory usage.
Modern data solutions like Apache Arrow and DuckDB provide efficient columnar storage
and query capabilities. The arrow package enables efficient
on-demand access and streaming reads with
arrow::open_dataset(), while DuckDB brings SQL-powered
processing to R with support for lazy evaluation and filtering only
relevant data subsets.
The package targets combines chunk-wise processing, parallelization, and multisession execution seamlessly via dynamic branching.
Memory management in R is a deep rabbit hole, but there are several great resources out there that may help you find your footing on this topic:
The hardware utilization techniques we’ve covered impact several key performance dimensions:
The time required to run the code. Vectorization and parallelization are your primary tools for improving execution speed, but they work best when combined with efficient algorithms and data structures.
Peak memory usage during run time. Techniques like in-place modification, pre-allocation, and on-demand data access help keep memory usage under control. Remember: more speed often means more memory, so there’s always a trade-off.
How well the code handles file access, network usage, and database queries. Memory-mapped files and modern tools like Arrow and DuckDB can dramatically improve I/O efficiency for large datasets.
How well the code adapts to increasing workloads and larger infrastructures. Good scalability means your code performs well whether processing 1,000 rows or 1 billion rows. Understanding concepts like Universal Scalability Law can help you predict and improve how your code scales.
The trade-off between computational cost and energy consumption. More efficient code doesn’t just save time and money—it also reduces environmental impact. Remember: computing has a serious environmental footprint.
Code optimization is fundamentally about making intelligent trade-offs. Throughout this trilogy, we’ve seen that:
The most efficient code is code that: - Solves the problem correctly - Is readable and maintainable - Performs well enough for its use case - Makes thoughtful use of available resources
Remember the three commandments: 1. Thou shall not optimize thy code (unless necessary) 2. Thou shall make thy code simple 3. Thou shall optimize wisely
Optimization is an iterative process. Start with clean, correct code. Profile to find bottlenecks. Apply targeted optimizations. Measure the impact. And most importantly, know when to stop.
The goal isn’t perfection—it’s creating software that’s efficient for both humans and machines, maintainable over time, and sustainable in its resource usage.
Thank you for joining me on this journey through code optimization. May your code be fast, your memory usage reasonable, and your bugs few!
For those who want to dive deeper, here are some excellent resources: